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Abstract 

A 3 -dimensional GPU Poisson solver is developed for all possible combinations of free and periodic boundary con- 
ditions along the three directions. It is benchmarked for various grid sizes and different BCs and a significant perfor- 
mance gain is observed for problems including one or more free BCs. The GPU Poisson solver is also benchmarked 
against two different CPU implementations of the same method and a significant amount of acceleration of the com- 
putation is observed with the GPU version. 
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1. Introduction 

In the context of electronic structure calculations of 
molecules, wires, surfaces and solids, it is necessary to 
solve Poisson's equation with various boundary condi- 
tions (BCs), namely free BCs, wire BCs (periodic along 
one axis, free along two axes), surface BCs (periodic 
along two axes, free along one axis) and fully periodic 
boundary conditions. 

A common method for solving the Poisson's equation 



v 2 y = P 



(D 



exploits the Fast Fourier Transform (FFT) algorithm in 
order to transform the input density p into the Fourier 
space. The desired electrostatic potential V can then 
easily be found in Fourier space and it is converted back 
to the real space by an inverse FFT. This FFT based 
solution is preferred for large data sets because of its 
O(NLogN) scaling due to the FFTs, N being the num- 
ber of grid points for the discrete representation of the 
density p and the solution V. This method is also suit- 
able for GPU computation since FFTs are efficiently 
implemented for GPUs by various vendors and groups 
OJflSli- Currently the NVIDIA CUFFT library sup- 
ports ID, 2D and 3D FFTs both for real and complex 
input data in single and double precision. 



"Corresponding author 

Email address: Stefan. goedeckerOunibas . ch (Stefan 
Goedecker) 



In current GPU architectures, the data has to be trans- 
ferred to the device memory prior to the computation 
and the result has to be transferred back to the system 
memory to be used in the succeeding CPU parts of the 
code. These data transfer times are comparable to the 
GPU computation times for the Poisson solver case, re- 
ducing the possible gains of GPUs for such calculations. 
The situation is less problematic for the case of free BCs 
since the transferred data size for such problems is half 
of the FFT size for each dimension. For a 3D input data, 
which is usually encountered in physical problems, the 
transferred data size is reduced to one eighth of the total 
FFT size, making the problem suitable for GPU com- 
putation. As the number of axis with free boundary 
conditions decreases when going from fully free bound- 
ary conditions to fully periodic boundary conditions, the 
data size increases to a quarter, a half and finally the full 
FFT data set size and the transfer times dominate more 
and more. Additionally, 3D FFT computation time for 
free BC problems can be reduced by using customized 
algorithms since initially only one eighth of the 3D FFT 
size is non-zero Q3Q . 

In this work, we describe a GPU Poisson solver 
which uses customized 3D FFTs to increase the per- 
formance for free BCs. This customized GPU Poisson 
solver is benchmarked for various grid sizes for fully 
free, wire and surface BCs to see the possible gain in 
the presence of free BCs. It is also benchmarked against 
a CPU implementation of the same method which uses 
the FFTW library |0] for ID batched FFTs. The de- 
veloped GPU Poisson was implemented in the BigDFT 
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wavelet based electronic structure package flU-Hfc and 
benchmarked against the current CPU Poisson solver of 
BigDFT HI for fully free BCs. 

2. Poisson's Equation 

In the discretized integral form, the Poisson's equa- 
tion (Eq. Q]) for a periodic input density becomes a con- 
volution of the form: 



V(r) = J] K p (r - r')p(r') 



(2) 



where K p (r - r') is the Green's function associated with 
the periodic BCs. This equation can be solved easily in 
Fourier space just with a multiplication operation: 



y F (k) = ^(k) P F (k), 



(3) 



where the superscript F denotes the Fourier transform 
of the corresponding quantity which can be calculated 
by the FFT algorithm. Once the potential is calculated 
in Fourier space with the above formula, it can be con- 
verted back to the real space with an inverse FFT op- 
eration. The advantage of this strategy is that the con- 
volution in Eq. [2]is calculated in O(NLogN) operations 
instead of the 0(N 2 ) operations of the direct calculation, 
where N = N x N y N z is the total number of grid points in 
the 3 -dimensional input density. 

Poisson's equation with free BCs can be treated simi- 
larly by using the free boundary Green's function Kf in- 
stead of K p in Eq. [3] Since the use of a FFT implies pe- 
riodicity, free boundary conditions can only be achieved 
by zero padding to the input density, doubling its size 
in each dimension 1 11, 12, Fill . In this way the wrap 
arounds in FFTs can be avoided at the cost of enlarging 
the problem size. For a mixed BCs problem, the zero 
paddings should be applied for all the dimensions hav- 
ing free BCs. Examples of these mixed BCs problems 
are surfaces with one free dimension and wires with two 
free dimensions for which the associated Green's func- 
tions have to be used in the calculation 1 14]. 

3. Customized GPU Poisson Solver 

In our GPU Poisson solver, we implemented the 
method used in the Poisson solver of the BigDFT elec- 
tronic structure package lioll . For free BC dimensions, 
the input density is zero padded doubling the data size 
for that dimension. Direct application of 3D FFT li- 
braries for such a zero padded input data is not efficient 
since these algorithms work on the whole range regard- 
less of the data being zero or not. In our developments, 



a customized 3D FFT is used for this problem which 
avoids the calculation of ID FFTs over the zero padded 
parts of the input density. 

Basically, a 3D FFT can be calculated in three steps 
each having ID FFTs of all the rows in a separate di- 
mension. In our customized 3D FFT, zero padded parts 
of the data are identified for each dimension and the un- 
necessary calculations over these parts are avoided. To 
understand the algorithm let us denote the data size in 
three dimensions as N x ,N y and N z and the BCs as B x , 
By and B z having values 1 for periodic and 2 for free 
BCs. Then the FFT size for each dimension is given 
as Sj = Nj x Bp In the first step, ID FFTs of length 
S x should be calculated for iV v x N z rows instead of the 
full range of S y X S z rows. For the second dimension, 
ID FFTs of length S y should be calculated for S x x N z 
rows since non-zero elements emerge in the zero padded 
parts of the x-dimension after the first step. In the third 
step the whole size of S x x S y rows should be pro- 
cessed for S z length ID FFTs. Therefore, for a fully 
free boundary problem, the total number of ID FFT op- 
erations is reduced to 7/12 (approximately 58%) of the 
direct calculation of the whole zero padded range with 
a 3D FFT library such as CUFFT3D. The inverse of the 
zero padding operation is carried out for the inverse 3D 
FFT after the multiplication in Eq. [3] as compactifica- 
tions of the results by taking out the unnecessary parts. 
Therefore, the same amount of reduction in the com- 
putation time is also valid for the inverse 3D FFT. For 
the heterogeneous BC cases (surface and wire BCs), the 
reduction amount of ID FFT operations is influenced 
by the choice of the axis along which free dimensions 
are applied (one axis for surface BCs and two axes for 
wire BCs). If the free dimension(s) is/are chosen as the 
last dimension(s) of the forward transform then the zero 
paddings can be postponed and the performance of the 
method increases. In the inverse transform, free BCs 
dimension(s) are processed first since the order of axes 
reverses for this case. For the wire BCs, if the periodic 
dimension is chosen as the first dimension of the for- 
ward 3D FFT then the reduction of ID FFT operations 
is the same as the fully free BCs discussed above. The 
total performance for wire BCs is slightly better than the 
fully free BCs of same zero padded size since for the 
wire case there is no need for the initial zero padding 
operation before the FFTs in the first dimension (also 
the final compactification in the inverse FFT). For the 
surface BCs, by choosing the free boundary dimension 
as the last dimension of the forward 3D FFT, the number 
of the ID FFT operations reduces to 2/3 (approximately 
66%) of the fully periodic case. 

In the calculation of a 3D FFT, a transposition oper- 
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ation usually follows each of the three steps described 
above in order to conserve the desired access pattern for 
optimal performance fl5ll . Such transpositions are also 
used in our customized 3D GPU FFT in order to pre- 
serve the coalesced access pattern to the global memory 
for the computations in three separate dimensions. If 
the zero paddings for the free BCs treatment are applied 
at the beginning of the calculation for all free boundary 
dimensions then the zero parts of the data should also be 
processed during these transpositions. In order to avoid 
these unnecessary operations, zero padding operations 
are applied for dimensions having free BCs just before 
the FFTs in that dimensions by a spread operation which 
doubles the data length for each row and sets the first 
half to zero. (In the inverse FFT, compactification op- 
erations are carried out after the FFTs in each dimen- 
sion having free BCs as throwing out the second half of 
each row.) In this strategy the transpositions are applied 
with reduced data sets and additionally the zero padding 
spread GPU kernel (also the compactification kernel in 
inverse FFT) can be merged with the preceding trans- 
pose GPU kernel (except for the first dimension which 
does not have the transposition) to reduce the number 
of read/write operations to the GPU global memory. As 
an example, for fully free BCs the first transposition op- 
eration between the first two dimensions of the forward 
transform is carried out with a data size which is a quar- 
ter of the total zero padded FFT size since the second 
and the third dimensions are not zero padded at that mo- 
ment. The second transposition between the second and 
the third dimensions is carried out with the half of the 
total zero padded FFT size since the last dimension is 
not zero padded yet. The final transposition after the 
third dimension should be operated on the full size but 
this transposition can be eliminated completely with a 
transposition of the associated kernel of the convolution 
in Eq. [3] which is carried out only once in the multiple 
application of the Poisson solver which is usually the 
case in realistic applications. 

Since the input density for the Poisson solver is al- 
ways a real valued quantity it would be advantageous to 
use real to complex FFTs for the first dimension of the 
forward 3D FFT. Because of the property F^-n = F* n of 
a real to complex FFT it is enough to keep first S x /2 + 1 
terms of the results after the ID FFTs in the first di- 
mension and the succeeding ID FFTs in the remaining 
dimensions are carried out over this reduced data set of 
size (S x /2+l)xS y xS z . It is not necessary to unfold the 
result at the end of the FFT for the Poisson solver case 
since an inverse FFT follows the forward FFT yielding 
a real potential at the end. The complex to real ID FFTs 
for the last dimension of this inverse transform has the 



same symmetry mentioned above. 

Separate steps of the customized GPU algorithm is 
given below where bold text is used to indicate the 
GPU kernels. Processed data size is indicated for each 
ID FFT step. The transpose or transpose +spread 
(compactification+transpose for inverse FFT) kernels 
process same size of data as the preceding ID FFTs: 

• if(B x is free) spread 'JC 

• lDFFTJi (N, X N, rows, length SJ (R2C) 

• if (By is free) transpose +spread_Y 
else transpose 

• 1DFFT.Y ((SJ2 + 1) X N : rows, length S y ) 

• if(B~ is free) transpose +spread^ 
else transpose 

• 1DFFTJ. ((SJ2 + 1) X S y rows, length S -J 

• convolution kernel multiplication 

• inverveJDFFTJZ ({SJ2 + 1) X S y rows, length S z ) 

• if(B. is free) compactificationJL+transpose 
else transpose 

• inverse JDFFI _Y ((S _J2 +1)XJV. rows, length S y ) 

• if (B y is free) compactification J/+transpose 
else transpose 

• inverse JDFFTJC (N y X N z rows, length S x ) (C2R) 

• if (B x is free) compactification JC 

In the implementation of the method we used the 
NVIDIA CUFFT1D library for evaluating batched 
ID FFTs and the CUDA language for transpositions 
and spread operations. The method was tested with 
NVIDIA CUDA Toolkit version 4. 1 on a NVIDIA Tesla 
series M2090 GPU using double precision. An OpenCL 
implementation of the method using custom ID FFT 
GPU kernels is also in progress. 

4. Results 

Benchmark calculations for fully periodic, surface, 
wire and fully free BCs are carried out by varying the 
grid size. Computation times for the fully periodic 
case not including the data transfers and computation 
times of the cases having one or more free boundary 
dimensions relative to the fully periodic case are given 
in Figure Q] for different grid sizes. In the figure, grid 
sizes are always given as the zero padded FFT sizes 
(S x x S v x S z ) instead of the input density sizes in order 
to keep FFT sizes constant for different BCs, yielding a 
better comparison of the results. The computation time 
percentages for the free BCs case and the wire BCs case 
are close to the calculated percentage of 58% for the 
amount of ID FFTs except for the small grid dimen- 
sions. Small deviations from this value are due to the 
spread, transposition and convolution kernel multiplica- 
tion operations which become more important for small 
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grid dimensions. For the surface BCs, again the compu- 
tation time percentages are close to the calculated per- 
centage of 66% for the amount of ID FFTs except for 
the small grid dimensions and the variations can be ex- 
plained with the same reasoning as the free and wire 
BCs cases. 



Performance of the customized GPU Poisson solver 
in the presence of free dimensions is compared also 
with a CPU implementation of the same method where 
FFTW library is used for batched ID FFTs and CUDA 
kernels for spread and transposition operations are re- 
placed with corresponding CPU functions. An Intel 17 
930 2.8 GHz Quad Core processor is used in these CPU 
calculations and the comparison is made against single 
core and also against 4 cores using OpenMP paralleliza- 
tion. Computation times of the GPU computation with 
and without data transfers are given for fully free, sur- 
face and wire BCs in Figure [2] relative to single core 
CPU computation times and in Figure [3] relative to 4 
cores CPU computation times. Grid sizes are given as 
the zero padded FFT sizes (S x xS v x S ,) also for these 
comparisons. For the single core case, computation 
time ratios between the CPU implementation and the 
GPU implementation vary in between 7.798 and 24.389 
depending on the grid size and BCs. For the 4 cores 
CPU case the range of this ratio is between 2.769 and 
1 1 .090. When the transfer times of sending the input 
density to the device memory and getting the resulted 
potential from the device memory are also included in 
the comparison, the computation time ratios of the CPU 
case to the GPU case vary in between 2.879 and 15.602 
for single core case and this ratio is between 1.154 and 
6.815 for 4 cores CPU case. 



In order to test the developed GPU Poisson solver in 
a real application, we implemented it in the BigDFT 
electronic structure package and compared its perfor- 
mance with the CPU Poisson solver of the BigDFT for 
fully free BCs which is faster than the CPU implemen- 
tation using FFTW for fully free BCs since it takes ad- 
vantage of merging transpose and kernel multiplication 
operations with customized ID FFTs. These compari- 
son results are given in Figure [4] and Figure [5] for single 
core CPU and 4 cores CPU respectively. Including data 
transfer times, for the single core case computation time 
ratios between the CPU implementation and the GPU 
implementation vary in between 4.967 and 14.922 de- 
pending on the grid size. For the 4 cores CPU case the 
range of this ratio is between 1.610 and 4.189. 



5. Conclusions 

We have developed a customized GPU Poisson solver 
which is generally applicable to homogeneous and 
mixed BCs. Benchmark calculations show significant 
reduction of computation times compared to the fully 
periodic BCs case for the cases having one or more free 
BCs. The largest reduction of computation times are 
observed in the case of wire BCs where the periodic di- 
mension is chosen as the first dimension of the forward 
3D FFT. The reductions of computation times compared 
to the fully periodic case are in consistence with the re- 
duction of total number of ID FFT operations and these 
reductions would not be observed in the direct appli- 
cation of a 3D FFT library for these cases with one or 
more free BC dimensions. Instead, in such a direct ap- 
plication, the computation times would slightly increase 
compared to the fully periodic case because of the addi- 
tional zero padding spread operations and correspond- 
ing compactifications. However, the developers of var- 
ious 3D FFT libraries may benefit from the techniques 
described in this work to increase the performance of 
their libraries for such zero padded input data. 

Comparison with CPU computations show that the 
customized GPU Poisson solver provides up to ~24.4 
times faster computations compared to single core CPU 
computation when data transfer times are not consid- 
ered. With the advent of new hardware which merges 
CPU and GPU on the same chip and with new technolo- 
gies like NVIDIA GPUDirect [ 16] which allows calling 
MPI subroutines between GPU memories, transfers to 
and from the GPU will be required less frequently and 
it will be possible to run a large fraction of sophisti- 
cated software packages on the GPU. Also, when mul- 
tiple solution of the Poisson's equation from different 
input densities are required (as it is the case, for exam- 
ple, in Quantum Chemistry or electronic structure cal- 
culations), the transfer time for a given density can be 
overlapped by the calculation of the previous one 
For these reasons we give also the speedups without the 
transfer times. Even when the transfer times are in- 
cluded, the customized GPU Poisson solver provides up 
to ~15.6 times faster computations compared to single 
core CPU. This means that the developed GPU Poisson 
solver can be used in real applications including free 
BCs to accelerate the Poisson solver part of the code. 
Figures [2] and [3] show that the performance gain from 
GPU computation reduces for small sizes and it is max- 
imal for large power of two sizes. Increase of the data 
transfer to computation time ratios with increasing num- 
ber of periodic dimensions is also evident in these Fig- 
ures. Therefore, when data transfer times are also con- 
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sidered, the amount of acceleration due to GPU usage 
increases as the number of free dimensions increases. 

The new developments are implemented in the 
BigDFT Electronic Structure Package version 1.7 and 
the Poisson solver part of the code is accelerated as 
shown in Figure [4] and Figure [5] for fully free BCs cal- 
culations. Up to ~14.9 times acceleration compared to 
single core CPU is observed in these tests including data 
transfer times. For surface and wire BCs, the compar- 
ison was not carried out since BigDFT code does not 
consider the optimal order of periodic and free dimen- 
sions for these mixed BCs cases at the moment. The 
source code of the BigDFT package can be accessed 
from the BigDFT project web page [ 17] under GPL Li- 
cense agreement. 

The convolution operation which is the main part 
of the Poisson solver discussed in this work is not 
only required in the context of solving Poisson's 
equation, but also in many other contexts. In plane 
wave density functional programs, the application of 
the local potential onto the wave function is for instance 
a convolution of the same type as the convolution in 
the solution of Poisson's equation with free BCs. Such 
programs have also been ported to GPU's II 1 811 but the 
possible speedup that could be obtained in this con- 
text by a customized FFT have not been exploited so far. 
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Figure 1: Comparison of the GPU computation times for different BCs on a NVIDIA Tesla M2090 GPU. Fully free, wire and surface BCs compu- 
tation times are shown as percentages of fully periodic BCs computation times which are given under corresponding column bars in milliseconds. 
The grid sizes given in the figure are FFT sizes (S x x S v X S z ) after the zero paddings in necessary dimensions. 
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Figure 4: Comparison of the GPU (NVIDIA Tesla M2090) computation times with BigDFT Poisson solver single core CPU (Intel 17 930) compu- 
tation times for fully free BCs. GPU and GPU + data transfer times are shown as percentages of single core CPU computation times for different 
grid sizes. Single core CPU computation times are given on each column bar in milliseconds. The grid sizes given in the figure are FFT sizes 
(S x xSyX S z ) after the zero paddings in all dimensions. 
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Figure 5: Comparison of the GPU (NVIDIA Tesla M2090) computation times with BigDFT Poisson solver 4 cores CPU (Intel 17 930) computation 
times for fully free BCs. GPU and GPU + data transfer times are shown as percentages of 4 core CPU computation times for different grid sizes. 4 
cores CPU computation times are given on each column bar in milliseconds. The grid sizes given in the figure are FFT sizes (S x X S y X S z ) after 
the zero paddings in all dimensions. 
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